%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy
from scipy import stats
import h5py
import math
#this sets the size of the plot to something useful
plt.rcParams["figure.figsize"] = (10,7)
In this problem, I will be looking at the data from a gamma-ray satellite orbiting in low Earth orbit. It takes a reading of the number of particles detected every 100 milliseconds, and is in an approximately 90 minute orbit. While it is looking for gamma-ray bursts, virtually all of the particles detected are background cosmic rays.
As with most data, there are 'features.' My lab instructor has helpfully incorporated the meta-data into the data file.
First, I downloaded the data from the course website (gammaray_lab4.h5), and imported it into my working environment. I am using cloud python, so I imported this in the terminal using the command: "wget -O gammaray_lab4.h5 https://canvas.uw.edu/courses/1401649/files/67789336/download?wrap=1"
The data has 4 columns and more than 25 million rows. The columns are time (in gps seconds), Solar phase (deg) showing the position of the sun relative to the orbit, Earth longitude (deg) giving the position of the spacecraft relative to the ground, and particle counts.
# Import file
hf = h5py.File('gammaray_lab4.h5', 'r')
# Import data into an array
data = np.array(hf.get('data'))
# Let's print the first row as an example
print(data[:,0])
hf.close()
[9.40680016e+08 3.15000000e+02 4.50000000e+01 1.00000000e+01]
Next, I will make a few plots, exploring my data and making sure that I understand it. I will start by making scatter plots of the first 150k data points (rows) of the time vs solar phase, the time vs earth longitude, and the time vs particle counts.
plt.plot(data[0, 0:150000], data[1, 0:150000], '-')
plt.xlabel('Time (gps s))', fontsize=15)
plt.ylabel('Solar phase (deg)', fontsize=15)
plt.tick_params(labelsize=15)
plt.show()
plt.plot(data[0, 0:150000], data[2, 0:150000], '-')
plt.xlabel('Time (gps s))', fontsize=15)
plt.ylabel('Earth phase (deg)', fontsize=15)
plt.tick_params(labelsize=15)
plt.show()
plt.plot(data[0, 0:150000:1500], data[3, 0:150000:1500], 'o')
plt.xlabel('Time (gps s))', fontsize=15)
plt.ylabel('Particle count', fontsize=15)
plt.tick_params(labelsize=15)
plt.show()
The time vs particle counts graph appears to have a lot of volatility and did not give much use for plotting the first 150k points. Let us instead plot this with only 100 points of data to see if there is any difference.
plt.plot(data[0, 0:100], data[3, 0:100], 'o')
plt.xlabel('Time (gps s))', fontsize=15)
plt.ylabel('Particle count', fontsize=15)
plt.tick_params(labelsize=15)
plt.show()
By plotting the particle counts of the first 150k data points (using a smear) and also only the first 100 data points, we can clearly see that the particle count values are not constant, but rather fluctuates randomly.
From these first 3 scatter plots, I can see that both the Solar phase and Earth phase are positively correlated with time and they both are periodic distributions. I can see that the Solar phase and Earth phase repeat every 360 degrees (or what appears to be every 5400 gps seconds or ~90 minutes) on the time axis. This is expected from the information (the full orbit has a period of 90 minutes).
The particle counts seem to vary with some distribution. I know that the number of particles I can measure with my data collection must be discrete, so I have the choice between two main distributions for the background: Binomial and Poisson. I know that cosmic ray backgrounds are typically modeled by the Poisson distribution, I will choose to try building the background pdf() with a Poisson distribution.
I'd like to explore how the background changes. I will first plot a smear of the data, to see how it changes over a larger period of time. The entire dataset is 25 million records; let's just take a look at 300k samples, but only every 3000 points. This should give us around 100 points on a scatter plot for each graph plotted.
def plot_smear(start, end, steps):
plt.scatter(data[0, start:end:steps], data[1, start:end:steps], label = 'Time (gps s) vs Solar phase (deg)')
plt.scatter(data[0, start:end:steps], data[2, start:end:steps], label = 'Time (gps s) vs Earth phase (deg)')
plt.scatter(data[0, start:end:steps], data[3, start:end:steps], label = 'Time (gps s) vs Particle count (# particles))')
start = 0
end = 300000
steps = 3000
plot_smear(start, end, steps)
plt.legend()
plt.xlabel('Time (gps s))', fontsize=15)
plt.ylabel('Degrees or Particle Count', fontsize=15)
plt.tick_params(labelsize=15)
plt.show()
For the Solar and Earth phases, they repeat periodically over this larger range of data points. However, it seems that the phases become closer together over time and switch places; the data begins with the Earth phase leading the Solar phase by about ~80 degrees, then about halfway through the 300k data points, they get so close that they overlap and then the Solar phase leads the Earth phase by about ~25 degrees near the end of the 300k data points graphed here. This makes a lot of sense, as the satellite revolves around the Earth and the Earth revolves around the Sun, thus the measurements of the Earth and Solar phase are not going to have the same relationship over any significant amount of time. The entire dataset is 25 million data points, which is 2.5 million gps seconds, or approximately 28.9 days worth of data. Thus, this pattern makes sense and I will ignore it in constructing my model.
What I want to focus on is my particle counts, which are my signals. I see some signal contamination. It appears that the highest values of particle counts are measured when both the Solar phase and Earth phase degrees are high. This could indicate that there is a shift in the average of the particle count value over time, in a periodic pattern.
I would like to explore the relationship of the particle count with both the Solar and the Earth phase, which are both parts of the metadata that would have a direct influence on the particle count.
Let's plot the particle counts and the Solar phase or Earth phase data on 2d histograms to see if we can determine any clear patterns.
fig, ax = plt.subplots()
h = ax.hist2d(data[1, start:end], data[3, start:end], bins = 28)
fig.colorbar(h[3], ax = ax, label = 'Number of data points')
plt.xlabel('Solar phase (deg)', fontsize = 15)
plt.ylabel('particle counts (# particles)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.show()
fig, ax = plt.subplots()
h = ax.hist2d(data[2, start:end], data[3, start:end], bins = 28)
fig.colorbar(h[3], ax = ax, label = 'Number of data points')
plt.xlabel('Earth phase (deg)', fontsize = 15)
plt.ylabel('particle counts (# particles)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.show()
As we can see, the relationship between the Earth longitude (deg) and the particle counts is much more clear than that of the Solar phase and the particle counts. From this 2d histogram, it appears that the average particle count shifts from higher values to lower values in a periodic pattern. This is a much clearer sign of contamination of the data from the different Earth phases. The average appears to shift from around 10 to around 5 every period.
Next, I will try to create a model for the background that includes time dependence, and then explicitly compare my model to the data.
I will start by shifting the 2d histogram of Earth longitude vs particle counts so that I can approximate an average particle count function, which I will use to construct the Poisson-distributed background. I do this iteratively, and found that a shift of 315 degrees forward (about 315/360 = 87.5% of the full period) allows me to see the entire period of the Earth phase without the jump:
fig, ax = plt.subplots()
h = ax.hist2d((data[2, start:end] - 315) % 360, data[3, start:end], bins = 28)
fig.colorbar(h[3], ax = ax, label = 'Number of data points')
plt.xlabel('Earth phase (deg)', fontsize = 15)
plt.ylabel('particle counts (# particles)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.show()
With this folded 2d histogram plot, it is much more clear that the average has a maximum of at about 10 particles and a minimum of about 5 particles. The data has a gradually decreasing average particle count from ~10 to ~5, then jumps suddenly back to ~10 to repeat.
These are the parameters I will start with to build my background pdf().
How do I include time dependence in my model so that this Poisson distribution can accurately capture the entirety of the data, with its changing average particle counts? The actual average particle count fluctuations resemble a negative exponential function. I can try to capture this pattern in my model (only over the first period of data). First I will build the average particle count fluctuations, then use that to construct the random variates and plot those along with the actual data to see how well the background model overlaps with the actual data.
start_model = 0
earth_long_model = np.linspace(start_model, 360, 100) # one full cycle, 100 total steps
lambda_model = 6*np.power(math.e, -earth_long_model/50)+5.9
bkgd_model = stats.poisson.rvs(lambda_model, size=100)
fig, ax = plt.subplots()
h = ax.hist2d((data[2, start:end] - 315) % 360, data[3, start:end], bins = 28)
ax.plot(earth_long_model, bkgd_model, color = 'w', label = 'Poisson Background model')
ax.plot(earth_long_model, lambda_model, linewidth = 6, color = 'r', label = 'Average particle count')
fig.colorbar(h[3], ax = ax, label = 'Number of data points')
plt.xlabel('Earth phase (deg)', fontsize = 15)
plt.ylabel('particle counts (# particles)', fontsize = 15)
plt.tick_params(labelsize=15)
ax.legend()
plt.show()
plt.plot(earth_long_model, bkgd_model, color = 'k', label = 'Poisson Background model')
plt.plot((data[2, start:end:steps] - 315) % 360, data[3, start:end:steps], 'o', color = 'r', label = 'Actual Data')
plt.xlabel('Earth phase (deg)', fontsize = 15)
plt.ylabel('particle counts (# particles)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.legend()
plt.show()
After plotting the Poisson-distributed random variables (with $\lambda$ time dependent) along with the smeared data on the same graph, I observe that my model looks very close to the actual data of the particle counts. (The smeared graph means that a few outliers are also plotted, but the majority of the particle count values for any Earth phase tends to gather close to the model.)
The equation for the model's parameter lambda that is used in the Poisson distribution is found iteratively (visually on the plot) and is given by the following, where the Earth longitude is given by $\phi$: $$\lambda(\phi) = 6 e^{-\phi/50} + 5.9$$
And the Poisson background model probability distribution of x particle counts is given by $$ Prob(x) = \frac{\lambda^x e^{-\lambda}}{x!}$$
This model, though based on $\phi$ (Earth Longitude in degrees), is time-dependent because the Earth phase can be derived from time. The second plot I constructed in Part 1 shows this relationship on a plot. I will now convert this into a relationship between any time (in gps seconds) to Earth longitude (degrees).
For convenience, here is the initial Earth phase and the initial time in GPS seconds.
print(f'The initial Earth phase is {data[2,0]} degrees.')
print(f'The initial time in gps seconds is {data[0,0]}.')
The initial Earth phase is 45.0 degrees. The initial time in gps seconds is 940680016.0.
Recall that I shifted the data ahead by 315 degrees in order to build my background pdf model. The initial Earth phase is 45 degrees, so I need to account for this in my model. The Earth phase at any point will be the initial phase, plus some phase difference that depends on the time.
The Earth phase repeats every 5400 seconds, so to get the fraction of the Earth phase that any data point is at, we will use the modulus operator and the period of the orbit, $T_{orbit} = $ 5400 gps s. After adding the initial phase, we will also further mod this expression by 360 to ensure that the Earth phase always remains within 0 and 360.
Thus the Earth phase at any point in time t can be found by the following equation:
$$\phi(t)= (\phi_0 + 360 * \frac{(t-t_{0}) \% T_{orbit}}{T_{orbit}})\%360 $$Where $\phi_0$ = 45.0, $t_0$ = 940680016.0, and $T_{orbit}$ = 5400.
Thus the equation relating the Poisson average particle count parameter $\lambda$ to time is given by: $$\lambda(t) = 6 e^{-(\frac{1}{50})(\phi_0 + 360 * \frac{(t-t_{0}) \% T_{orbit}}{T_{orbit}})\%360} + 5.9$$
Because my background distribution is time-dependent, the discovery sensitivity threshold (how many particles I would need to see) also varies. I would like to now explore what would be the "5-sigma" threshold for a 100 millisecond GRB at different times.
I have done this in previous labs. The general steps will be the following. First, I will take the 5-sigma threshold and convert it into a probability using the survival function on a normal distribution of mean 0 and standard deviation 1. Next, I will choose a few different times: 0 gps seconds, 1000 gps seconds, and 5400 gps seconds from the start time. That way, we can take a look at multiple points in the same period.
My statistical question is: "What signal would I need to measure in order for my background to have a probability corresponding to 5 sigma to produce this signal, at t = 0, 1000, and 5400 gps seconds after the initial time?"
Then for each of these times, I will generate the background Poisson distribution with the correspond parameter $\lambda$ that will be different for each of those times, given by the above equation. I will then finally use the inverse survival function to calculate the number of particles I would need to see my background produce to achieve a 5 sigma for each of those times.
sigma = 5
prob = stats.norm.sf(sigma)
print(f'The probability given by a 5 sigma event is {prob}.')
t_a = data[0,0] + 0
t_b = data[0,0] + 1000
t_c = data[0,0] + 5400
def calc_lambda(t):
phi = (data[2,0] + 360*((t - data[0,0]) % 5400) / 5400) % 360
return 6*np.power(math.e, -phi/50) + 5.9
lambda1 = calc_lambda(t_a)
lambda2 = calc_lambda(t_b)
lambda3 = calc_lambda(t_c)
The probability given by a 5 sigma event is 2.866515718791933e-07.
Let's plot the three different background model probability distributions for these 3 different times.
t = []
p_1 = []
p_2 = []
p_3 = []
for i in range (0, 30):
t.append(i)
p_1.append(stats.poisson.pmf(i, lambda1))
p_2.append(stats.poisson.pmf(i, lambda2))
p_3.append(stats.poisson.pmf(i, lambda3))
plt.plot(t, p_1, label = f'Poisson background pmf for time = {t_a - data[0,0]} after t_start')
plt.plot(t, p_2, label = f'Poisson background pmf for time = {t_b - data[0,0]} after t_start')
plt.plot(t, p_3, 'o', label = f'Poisson background pmf for time = {t_c - data[0,0]} after t_start')
plt.xlabel('Particle counts (deg)', fontsize = 15)
plt.ylabel('Probability', fontsize = 15)
plt.tick_params(labelsize=15)
plt.legend()
plt.show()
After the start time, the Earth phase increases linearly, corresponding to decreasing lambda parameter, as seen in the t = 1000 seconds after the start time plot. For a Poisson distribution, a decreasing lambda parameter correspondings to a plot where the average value (peak) shifts to the left. A 5 sigma event will require fewer particles as the peak shifts to the left. Therefore, within 1 cycle as time passes, the Earth phase increases, the average particle count decreases, the Poisson distribution peak shifts to the left, and we expect a decrease in the number of particles we count required for a 5 sigma event.
However, I also observe that after one full period, t = 5400 gps seconds after the start time, my Poisson distribution should be identical to that after t = 0 gps seconds after the start time. This makes sense as my parameter lambda is dependent on a periodic variable, phi.
Let us calculate the number of particles that would answer my statistical question, and report on our findings.
num_part1 = stats.poisson.isf(prob, lambda1)
num_part2 = stats.poisson.isf(prob, lambda2)
num_part3 = stats.poisson.isf(prob, lambda3)
print(f'The parameter lambda at time {t_a - data[0,0]} gps s after the start time is {lambda1}.')
print(f'The number of particles required to observe a 5 sigma event is {num_part1} for {t_a-data[0,0]:.0f} gps s after the start time.')
print(f'The parameter lambda at time {t_b - data[0,0]} gps s after the start time is {lambda2}.')
print(f'The number of particles required to observe a 5 sigma event is {num_part2} for {t_b-data[0,0]:.0f} gps s after the start time.')
print(f'The parameter lambda at time {t_c - data[0,0]} gps s after the start time is {lambda3}.')
print(f'The number of particles required to observe a 5 sigma event is {num_part3} for {t_c-data[0,0]:.0f} gps s after the start time.')
The parameter lambda at time 0.0 gps s after the start time is 8.339417958443594. The number of particles required to observe a 5 sigma event is 26.0 for 0 gps s after the start time. The parameter lambda at time 1000.0 gps s after the start time is 6.5430235925138405. The number of particles required to observe a 5 sigma event is 23.0 for 1000 gps s after the start time. The parameter lambda at time 5400.0 gps s after the start time is 8.339417958443594. The number of particles required to observe a 5 sigma event is 26.0 for 5400 gps s after the start time.
I can summarize this in a table:
| gps s after start time | lambda | num particles for 5 sigma event |
|---|---|---|
| 0 | 8.34 | 26 |
| 1000 | 6.54 | 23 |
| 5400 | 8.34 | 26 |
In this problem I am going to look at a stack of telescope images (again simulated). We have 10 images, but me and my lab partner will be looking for different signals. William will be looking for the faintest stars, while I will be looking for a transient (something like a super novae that only appears in one image).
First I will download the data from images.h5. This is a stack of 10 square images, each 200 pixels on a side.
# Import file
hf = h5py.File('images.h5', 'r')
# Import data into an array
imagestack = np.array(hf.get('imagestack'))
image1 = np.array(hf.get('image1'))
hf.close()
I will start by exploring the data, as I did before for Problem 1. Let's first take a look at all ten images.
# First adjust the figure size to something useful
plt.rcParams["figure.figsize"] = (15,40)
for i in range(0, 10):
plt.subplot(5, 2, i+1)
plt.title(f'Image Number {i+1}', fontsize=15)
plt.xlabel('Spatial direction x (pixels)', fontsize=15)
plt.ylabel('Spatial direction y (pixels)', fontsize=15)
plt.imshow(imagestack[:,:,i])
plt.tick_params(labelsize = 15)
plt.show()
Here, the telescope images are of the same section of the sky. There are several pixels throughout the image which are bright yellow, while others are blue, and some dark blue, and most of them a deep navy. All of the images look more or less the same to the human eye.
Let's take a look at a handful of the first few pixel values in image 1. I'll look at the first 2x2 pixels (origin begins at top left corner).
print(imagestack[0,0,0])
print(imagestack[0,1,0])
print(imagestack[1,0,0])
print(imagestack[1,1,0])
-0.5049209253540916 0.3134736734559387 -0.2528844391465269 -0.29488745324834853
Each pixel in the image has a value which represents its intensity. A higher pixel value means a brighter pixel.
Let me now try to add the images together, and average them all on a separate plot, to see if I can spot any signal contamination.
plt.rcParams["figure.figsize"] = (15,10)
w, h = 200, 200
sum_image = [[0 for x in range(w)] for y in range(h)]
for i in range(0,200):
for j in range(0,200):
for k in range (0,10):
sum_image[i][j] += imagestack[i,j,k]
plt.imshow(sum_image)
plt.title('Sum of all 10 images', fontsize = 20)
plt.xlabel('Spatial direction x (pixel)', fontsize = 15)
plt.ylabel('Spatial direction y (pixel)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.show()
w, h = 200, 200
avg_image = [[0 for x in range(w)] for y in range(h)]
for i in range(0,200):
for j in range(0,200):
for k in range (0,10):
avg_image[i][j] += imagestack[i,j,k]
avg_image[i][j] = avg_image[i][j]/10
plt.imshow(avg_image)
plt.title('Average of all 10 images', fontsize = 20)
plt.xlabel('Spatial direction x (pixel)', fontsize = 15)
plt.ylabel('Spatial direction y (pixel)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.show()
I observe no strange defects that would suggest that my data has any signal contamination. All the pixel intensities and positioning looks very predictable and consistent, so I will conclude there is no signal contamination.
Since stars emit light and generally do not move in the sky (with respect to Earth), we will assume that the stars in our images are bright pixels which are constant. However, there are other factors that may have an impact on the intensity: noise and transients.
The intensity of each pixel is made up of the light from stars, the background noise, and any transients. Thus, $$I_{pixel} = I_{stars} + I_{noise} + I_{transients} $$
I would like to construct a model for the background. The stars contribute an intensity term which is constant across each image, and the transients will contribute intensity terms for only one image (each transient event). If I subtract one image from another, I will remove the effect of the stars, while keeping the background noise and the transients. I can then set a threshold above which transients are classified. Let me state now that intensities which would result in an event of 5 $\sigma$ or greater will be considered transients in my model.
Let me construct the background, then hunt for transients in part 3. First let me see the result of subtracting image 2 from image 1:
w, h = 200, 200
diff12 = [[0 for x in range(w)] for y in range(h)]
for i in range(0,200):
for j in range(0,200):
diff12[i][j] = imagestack[i,j,0]-imagestack[i,j,1]
plt.imshow(diff12)
plt.title('Background noise + Transients', fontsize = 20)
plt.xlabel('Spatial direction x (pixel)', fontsize = 15)
plt.ylabel('Spatial direction y (pixel)', fontsize = 15)
plt.tick_params(labelsize=15)
plt.show()
To construct the background probability distribution function, I will consider only the background noise. The signal that a transient exists is a 5 sigma or greater event. Thus, I will construct a histogram using the pixel intensity values (flattened into a 1D array, then binned) of the difference between two images, say Images 1 and 2 which were just plotted. Then I will plot this histogram on a semi-log graph and try fitting different PDFs to this distribution to see what would model this background noise best.
flat1 = np.reshape(diff12, 200*200)
fig, ax = plt.subplots(1, 1)
ax.hist(flat1, 100, density=True)
plt.tick_params(labelsize = 15)
plt.xlabel('Intensity', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.title('Binned Intensities', fontsize = 20)
plt.show()
flat1 = np.reshape(diff12, 200*200)
fig, ax = plt.subplots(1, 1)
ax.hist(flat1, 100, density=True)
plt.tick_params(labelsize = 15)
plt.yscale('log')
plt.xlabel('Intensity', fontsize=15)
plt.ylabel('Semi log of Probability Density', fontsize=15)
plt.title('Binned Intensities Semi Log', fontsize = 20)
plt.show()
flat1 = np.reshape(diff12, 200*200)
fig, ax = plt.subplots(1, 1)
ax.hist(flat1, 100, density=True, label = 'Binned intensities')
x = np.linspace(-3,3,1000)
plt.plot(x, stats.norm.pdf(x, scale = 1/1.3), linewidth = 3, label = 'Gaussian semi-log')
plt.tick_params(labelsize = 15)
plt.yscale('log')
plt.xlabel('Intensity', fontsize=15)
plt.ylabel('Semi log of Probability Density', fontsize=15)
plt.legend(fontsize = 15)
plt.title('Binned Intensities Semi Log with Gaussian Fit', fontsize = 20)
plt.show()
On a semi-log graph, the data points around Intensity = 0 fit an upside-down parabola, characteristic of a Gaussian distribution. The fit is very well with a scale (sigma) parameter of 1/1.3 = 0.7692. The mean of the distribution is 0, as is the Gaussian fit's mean. The Gaussian pdf that fits the background noise is approximately the following, where x = Intensity: $$pdf_{background} = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}$$
where $\sigma$ = 0.7692 and $\mu$ = 0
Finally, using my background distribution, I will hunt for transients across all 10 images. I will first take 5 sigma and convert it into a probability, then generate a threshold (intensity value) above which intensity values will be considered representative of transients.
sigma = 5
prob = stats.norm.sf(sigma)
print(f'The probability given by a 5 sigma event is {prob}.')
st_dev_model = 0.7692
max_intensity = stats.norm.isf(prob, loc = 0, scale = st_dev_model)
print(f'Pixels with intensities over {max_intensity:.2f} will be considered transients.')
The probability given by a 5 sigma event is 2.866515718791933e-07. Pixels with intensities over 3.85 will be considered transients.
def transient_search(first, second): # inputs are images
# take the difference in order to remove stars, this keeps the background noise (radiation) and the transients
diff = imagestack[:,:,first] - imagestack[:,:,second]
num_transients = 0
for i in range(0,200):
for j in range(0,200):
if (diff[i][j] > max_intensity):
num_transients += 1
return num_transients
print(transient_search(0,1))
print(transient_search(1,2))
print(transient_search(2,3))
print(transient_search(3,4))
print(transient_search(4,5))
print(transient_search(5,6))
print(transient_search(6,7))
print(transient_search(7,8))
print(transient_search(8,9))
print(transient_search(9,0))
total_transients = 0
for i in range (0,9):
total_transients += transient_search(i,i+1)
total_transients += transient_search(9,0)
print(f'I found {total_transients} transients.')
0 0 1 0 0 1 0 0 0 1 I found 3 transients.
By taking the difference of each pair of images, I removed the stars and kept the background noise and the transients. I considered only transients that were of positive intensity, as all negative intensity pixels were from background noise. This made it very easy to simply sum up all of the transients I found. I found that there were 3 transients across all 10 images, that were bright pixels of intensity $\geq$ 3.85 and were not constant between images.
My lab partner and I had different background pdf()s, but we were using the same data because we were searching for different signals and asking different statistical questions. My lab partner was searching for faint stars while I was searching for transients. Thus, we had to come up with different methods to get the "background" which is appropriate for answering our respective questions. The background is defined as the signal-free data.
My background PDF was created by looking at the background noise and transients only, as I removed all of the stars, and my measured signal was a pixel that had intensity brighter than some threshold value and only existed for one image.
As for my lab partner, he had to remove transients from his background and in doing so only looked at the background noise and the stars.